[VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads#151300
[VPlan] Enable vectorization of early-exit loops with unit-stride fault-only-first loads#151300
Conversation
|
Regarding EVL propagation, I just noticed that header mask handling is fixed in #150202. This fix will need to be incorporated accordingly. |
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
|
I think it would be good for reviewers if this patch was split up into several parts, probably roughly in this order:
|
Thanks for outlining the steps. That’s very helpful! I’ll follow this order for splitting up the patch. |
| @@ -7749,6 +7758,12 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands, | |||
| Builder.insert(VectorPtr); | |||
| Ptr = VectorPtr; | |||
| } | |||
| if (Legal->getSpeculativeLoads().contains(I)) { | |||
| auto *Load = dyn_cast<LoadInst>(I); | |||
| return new VPWidenFFLoadRecipe(*Load, Ptr, Mask, VPIRMetadata(*Load, LVer), | |||
There was a problem hiding this comment.
Just making a note for later. I think it would be good to avoid having a dead VPWidenFFLoadRecipe when there's no non-EVL version of vp.load.ff.
Can we instead just have one VPWidenFFLoadRecipe and here pass VF as the EVL argument? optimizeMaskToEVL can then set the EVL from the header mask later.
There was a problem hiding this comment.
Updated! After rebasing, this patch supports WidenFFLoad in non–tail-folded mode only. To tail-fold early-exit loops and FFLoad support in the EVL transform will have to be addressed in a separate patch.
This patch splits out the legality checks from PR #151300, following the landing of PR #128593. It is a step toward supporting vectorization of early-exit loops that contain potentially faulting loads. In this commit, an early-exit loop is considered legal for vectorization if it satisfies the following criteria: 1. it is a read-only loop. 2. all potentially faulting loads are unit-stride, which is the only type currently supported by vp.load.ff.
1bb79f8 to
f8d7616
Compare
f8d7616 to
1ab67ae
Compare
|
@llvm/pr-subscribers-backend-risc-v @llvm/pr-subscribers-llvm-transforms Author: Shih-Po Hung (arcbbb) ChangesFollowing #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593. Key changes:
Limitations:
Patch is 34.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151300.diff 9 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 1d3cffa2b61bf..e28d4c45d4ab8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -393,6 +393,12 @@ static cl::opt<bool> EnableEarlyExitVectorization(
cl::desc(
"Enable vectorization of early exit loops with uncountable exits."));
+static cl::opt<bool>
+ EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false),
+ cl::Hidden,
+ cl::desc("Enable vectorization of early-exit "
+ "loops with fault-only-first loads."));
+
static cl::opt<bool> ConsiderRegPressure(
"vectorizer-consider-reg-pressure", cl::init(false), cl::Hidden,
cl::desc("Discard VFs if their register pressure is too high."));
@@ -3507,6 +3513,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
return FixedScalableVFPair::getNone();
}
+ if (!Legal->getPotentiallyFaultingLoads().empty() && UserIC > 1) {
+ reportVectorizationFailure("Auto-vectorization of loops with potentially "
+ "faulting loads is not supported when the "
+ "interleave count is more than 1",
+ "CantInterleaveLoopWithPotentiallyFaultingLoads",
+ ORE, TheLoop);
+ return FixedScalableVFPair::getNone();
+ }
+
ScalarEvolution *SE = PSE.getSE();
ElementCount TC = getSmallConstantTripCount(SE, TheLoop);
unsigned MaxTC = PSE.getSmallConstantMaxTripCount();
@@ -4076,6 +4091,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
case VPDef::VPReductionPHISC:
case VPDef::VPInterleaveEVLSC:
case VPDef::VPInterleaveSC:
+ case VPDef::VPWidenFFLoadSC:
case VPDef::VPWidenLoadEVLSC:
case VPDef::VPWidenLoadSC:
case VPDef::VPWidenStoreEVLSC:
@@ -4550,6 +4566,10 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
if (!Legal->isSafeForAnyVectorWidth())
return 1;
+ // No interleaving for potentially faulting loads.
+ if (!Legal->getPotentiallyFaultingLoads().empty())
+ return 1;
+
// We don't attempt to perform interleaving for loops with uncountable early
// exits because the VPInstruction::AnyOf code cannot currently handle
// multiple parts.
@@ -7216,6 +7236,9 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
// Regions are dissolved after optimizing for VF and UF, which completely
// removes unneeded loop regions first.
VPlanTransforms::dissolveLoopRegions(BestVPlan);
+
+ VPlanTransforms::convertFFLoadEarlyExitToVLStepping(BestVPlan);
+
// Canonicalize EVL loops after regions are dissolved.
VPlanTransforms::canonicalizeEVLLoops(BestVPlan);
VPlanTransforms::materializeBackedgeTakenCount(BestVPlan, VectorPH);
@@ -7598,6 +7621,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands,
Builder.insert(VectorPtr);
Ptr = VectorPtr;
}
+ if (Legal->getPotentiallyFaultingLoads().contains(I))
+ return new VPWidenFFLoadRecipe(*cast<LoadInst>(I), Ptr, &Plan.getVF(), Mask,
+ VPIRMetadata(*I, LVer), I->getDebugLoc());
+
if (LoadInst *Load = dyn_cast<LoadInst>(I))
return new VPWidenLoadRecipe(*Load, Ptr, Mask, Consecutive, Reverse,
VPIRMetadata(*Load, LVer), I->getDebugLoc());
@@ -8632,6 +8659,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
if (Recipe->getNumDefinedValues() == 1) {
SingleDef->replaceAllUsesWith(Recipe->getVPSingleValue());
Old2New[SingleDef] = Recipe->getVPSingleValue();
+ } else if (isa<VPWidenFFLoadRecipe>(Recipe)) {
+ VPValue *Data = Recipe->getVPValue(0);
+ SingleDef->replaceAllUsesWith(Data);
+ Old2New[SingleDef] = Data;
} else {
assert(Recipe->getNumDefinedValues() == 0 &&
"Unexpected multidef recipe");
@@ -8679,6 +8710,8 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
// Adjust the recipes for any inloop reductions.
adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start);
+ VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(*Plan);
+
// Apply mandatory transformation to handle FP maxnum/minnum reduction with
// NaNs if possible, bail out otherwise.
if (!VPlanTransforms::runPass(VPlanTransforms::handleMaxMinNumReductions,
@@ -9869,7 +9902,14 @@ bool LoopVectorizePass::processLoop(Loop *L) {
return false;
}
- if (!LVL.getPotentiallyFaultingLoads().empty()) {
+ if (EnableEarlyExitWithFFLoads) {
+ if (LVL.getPotentiallyFaultingLoads().size() > 1) {
+ reportVectorizationFailure("Auto-vectorization of loops with more than 1 "
+ "potentially faulting load is not enabled",
+ "MoreThanOnePotentiallyFaultingLoad", ORE, L);
+ return false;
+ }
+ } else if (!LVL.getPotentiallyFaultingLoads().empty()) {
reportVectorizationFailure("Auto-vectorization of loops with potentially "
"faulting load is not supported",
"PotentiallyFaultingLoadsNotSupported", ORE, L);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index f79855f7e2c5f..6e28c95ca601a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -563,6 +563,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
case VPRecipeBase::VPInterleaveEVLSC:
case VPRecipeBase::VPInterleaveSC:
case VPRecipeBase::VPIRInstructionSC:
+ case VPRecipeBase::VPWidenFFLoadSC:
case VPRecipeBase::VPWidenLoadEVLSC:
case VPRecipeBase::VPWidenLoadSC:
case VPRecipeBase::VPWidenStoreEVLSC:
@@ -2811,6 +2812,13 @@ class LLVM_ABI_FOR_TEST VPReductionEVLRecipe : public VPReductionRecipe {
ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp,
R.isOrdered(), DL) {}
+ VPReductionEVLRecipe(RecurKind RdxKind, FastMathFlags FMFs, VPValue *ChainOp,
+ VPValue *VecOp, VPValue &EVL, VPValue *CondOp,
+ bool IsOrdered, DebugLoc DL = DebugLoc::getUnknown())
+ : VPReductionRecipe(VPDef::VPReductionEVLSC, RdxKind, FMFs, nullptr,
+ ArrayRef<VPValue *>({ChainOp, VecOp, &EVL}), CondOp,
+ IsOrdered, DL) {}
+
~VPReductionEVLRecipe() override = default;
VPReductionEVLRecipe *clone() override {
@@ -3159,6 +3167,7 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase,
static inline bool classof(const VPRecipeBase *R) {
return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC ||
R->getVPDefID() == VPRecipeBase::VPWidenStoreSC ||
+ R->getVPDefID() == VPRecipeBase::VPWidenFFLoadSC ||
R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC ||
R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC;
}
@@ -3240,6 +3249,42 @@ struct LLVM_ABI_FOR_TEST VPWidenLoadRecipe final : public VPWidenMemoryRecipe,
}
};
+/// A recipe for widening loads using fault-only-first intrinsics.
+/// Produces two results: (1) the loaded data, and (2) the index of the first
+/// non-dereferenceable lane, or VF if all lanes are successfully read.
+struct VPWidenFFLoadRecipe final : public VPWidenMemoryRecipe, public VPValue {
+ VPWidenFFLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *VF, VPValue *Mask,
+ const VPIRMetadata &Metadata, DebugLoc DL)
+ : VPWidenMemoryRecipe(VPDef::VPWidenFFLoadSC, Load, {Addr, VF},
+ /*Consecutive*/ true, /*Reverse*/ false, Metadata,
+ DL),
+ VPValue(this, &Load) {
+ new VPValue(nullptr, this); // Index of the first lane that faults.
+ setMask(Mask);
+ }
+
+ VP_CLASSOF_IMPL(VPDef::VPWidenFFLoadSC);
+
+ /// Return the VF operand.
+ VPValue *getVF() const { return getOperand(1); }
+ void setVF(VPValue *V) { setOperand(1, V); }
+
+ void execute(VPTransformState &State) override;
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+ /// Print the recipe.
+ void print(raw_ostream &O, const Twine &Indent,
+ VPSlotTracker &SlotTracker) const override;
+#endif
+
+ /// Returns true if the recipe only uses the first lane of operand \p Op.
+ bool onlyFirstLaneUsed(const VPValue *Op) const override {
+ assert(is_contained(operands(), Op) &&
+ "Op must be an operand of the recipe");
+ return Op == getVF() || Op == getAddr();
+ }
+};
+
/// A recipe for widening load operations with vector-predication intrinsics,
/// using the address to load from, the explicit vector length and an optional
/// mask.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 46ab7712e2671..684dbd25597e3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -188,8 +188,9 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) {
}
Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) {
- assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) &&
- "Store recipes should not define any values");
+ assert(
+ (isa<VPWidenLoadRecipe, VPWidenFFLoadRecipe, VPWidenLoadEVLRecipe>(R)) &&
+ "Store recipes should not define any values");
return cast<LoadInst>(&R->getIngredient())->getType();
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 8e9c3db50319f..3da8613a1d3cc 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -73,6 +73,7 @@ bool VPRecipeBase::mayWriteToMemory() const {
case VPReductionPHISC:
case VPScalarIVStepsSC:
case VPPredInstPHISC:
+ case VPWidenFFLoadSC:
return false;
case VPBlendSC:
case VPReductionEVLSC:
@@ -107,6 +108,7 @@ bool VPRecipeBase::mayReadFromMemory() const {
return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory();
case VPWidenLoadEVLSC:
case VPWidenLoadSC:
+ case VPWidenFFLoadSC:
return true;
case VPReplicateSC:
return cast<Instruction>(getVPSingleValue()->getUnderlyingValue())
@@ -3409,6 +3411,47 @@ void VPWidenLoadRecipe::print(raw_ostream &O, const Twine &Indent,
}
#endif
+void VPWidenFFLoadRecipe::execute(VPTransformState &State) {
+ Type *ScalarDataTy = getLoadStoreType(&Ingredient);
+ auto *DataTy = VectorType::get(ScalarDataTy, State.VF);
+ const Align Alignment = getLoadStoreAlignment(&Ingredient);
+
+ auto &Builder = State.Builder;
+ State.setDebugLocFrom(getDebugLoc());
+
+ Value *VL = State.get(getVF(), VPLane(0));
+ Type *I32Ty = Builder.getInt32Ty();
+ VL = Builder.CreateZExtOrTrunc(VL, I32Ty);
+ Value *Addr = State.get(getAddr(), true);
+ Value *Mask = nullptr;
+ if (VPValue *VPMask = getMask())
+ Mask = State.get(VPMask);
+ else
+ Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue());
+ CallInst *NewLI =
+ Builder.CreateIntrinsic(Intrinsic::vp_load_ff, {DataTy, Addr->getType()},
+ {Addr, Mask, VL}, nullptr, "vp.op.load.ff");
+ NewLI->addParamAttr(
+ 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment));
+ applyMetadata(*NewLI);
+ Value *V = cast<Instruction>(Builder.CreateExtractValue(NewLI, 0));
+ Value *NewVL = Builder.CreateExtractValue(NewLI, 1);
+ State.set(getVPValue(0), V);
+ State.set(getVPValue(1), NewVL, /*NeedsScalar=*/true);
+}
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+void VPWidenFFLoadRecipe::print(raw_ostream &O, const Twine &Indent,
+ VPSlotTracker &SlotTracker) const {
+ O << Indent << "WIDEN ";
+ printAsOperand(O, SlotTracker);
+ O << ", ";
+ getVPValue(1)->printAsOperand(O, SlotTracker);
+ O << " = vp.load.ff ";
+ printOperands(O, SlotTracker);
+}
+#endif
+
/// Use all-true mask for reverse rather than actual mask, as it avoids a
/// dependence w/o affecting the result.
static Instruction *createReverseEVL(IRBuilderBase &Builder, Value *Operand,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 1f6b85270607e..7e78cb6ed02ac 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2760,6 +2760,102 @@ void VPlanTransforms::addExplicitVectorLength(
Plan.setUF(1);
}
+void VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan) {
+ VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock();
+ VPWidenFFLoadRecipe *LastFFLoad = nullptr;
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
+ vp_depth_first_deep(Plan.getVectorLoopRegion())))
+ for (VPRecipeBase &R : *VPBB)
+ if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) {
+ assert(!LastFFLoad && "Only one FFLoad is supported");
+ LastFFLoad = Load;
+ }
+
+ // Skip if no FFLoad.
+ if (!LastFFLoad)
+ return;
+
+ // Ensure FFLoad does not read past the remainder in the last iteration.
+ // Set AVL to min(VF, remainder).
+ VPBuilder Builder(Header, Header->getFirstNonPhi());
+ VPValue *Remainder = Builder.createNaryOp(
+ Instruction::Sub, {&Plan.getVectorTripCount(), Plan.getCanonicalIV()});
+ VPValue *Cmp =
+ Builder.createICmp(CmpInst::ICMP_ULE, &Plan.getVF(), Remainder);
+ VPValue *AVL = Builder.createSelect(Cmp, &Plan.getVF(), Remainder);
+ LastFFLoad->setVF(AVL);
+
+ // To prevent branch-on-poison, rewrite the early-exit condition to
+ // VPReductionEVLRecipe. Expected pattern here is:
+ // EMIT vp<%alt.exit.cond> = AnyOf
+ // EMIT vp<%exit.cond> = or vp<%alt.exit.cond>, vp<%main.exit.cond>
+ // EMIT branch-on-cond vp<%exit.cond>
+ auto *ExitingLatch = cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getExiting());
+ auto *LatchExitingBr = cast<VPInstruction>(ExitingLatch->getTerminator());
+
+ VPValue *VPAnyOf = nullptr;
+ VPValue *VecOp = nullptr;
+ assert(
+ match(LatchExitingBr,
+ m_BranchOnCond(m_BinaryOr(m_VPValue(VPAnyOf), m_VPValue()))) &&
+ match(VPAnyOf, m_VPInstruction<VPInstruction::AnyOf>(m_VPValue(VecOp))) &&
+ "unexpected exiting sequence in early exit loop");
+
+ VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1);
+ VPValue *Mask = LastFFLoad->getMask();
+ FastMathFlags FMF;
+ auto *I1Ty = Type::getInt1Ty(Plan.getContext());
+ VPValue *VPZero = Plan.getOrAddLiveIn(ConstantInt::get(I1Ty, 0));
+ DebugLoc DL = VPAnyOf->getDefiningRecipe()->getDebugLoc();
+ auto *NewAnyOf =
+ new VPReductionEVLRecipe(RecurKind::Or, FMF, VPZero, VecOp, *OpVPEVLI32,
+ Mask, /*IsOrdered*/ false, DL);
+ NewAnyOf->insertBefore(VPAnyOf->getDefiningRecipe());
+ VPAnyOf->replaceAllUsesWith(NewAnyOf);
+
+ // Using FirstActiveLane in the early-exit block is safe,
+ // exiting conditions guarantees at least one valid lane precedes
+ // any poisoned lanes.
+}
+
+void VPlanTransforms::convertFFLoadEarlyExitToVLStepping(VPlan &Plan) {
+ // Find loop header by locating VPWidenFFLoadRecipe.
+ VPWidenFFLoadRecipe *LastFFLoad = nullptr;
+
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
+ vp_depth_first_shallow(Plan.getEntry())))
+ for (VPRecipeBase &R : *VPBB)
+ if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) {
+ assert(!LastFFLoad && "Only one FFLoad is supported");
+ LastFFLoad = Load;
+ }
+
+ // Skip if no FFLoad.
+ if (!LastFFLoad)
+ return;
+
+ VPBasicBlock *HeaderVPBB = LastFFLoad->getParent();
+ // Replace IVStep (VFxUF) with returned VL from FFLoad.
+ auto *CanonicalIV = cast<VPPhi>(&*HeaderVPBB->begin());
+ VPValue *Backedge = CanonicalIV->getIncomingValue(1);
+ assert(match(Backedge, m_c_Add(m_Specific(CanonicalIV),
+ m_Specific(&Plan.getVFxUF()))) &&
+ "Unexpected canonical iv");
+ VPRecipeBase *CanonicalIVIncrement = Backedge->getDefiningRecipe();
+ VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1);
+ VPBuilder Builder(HeaderVPBB, HeaderVPBB->getFirstNonPhi());
+ Builder.setInsertPoint(CanonicalIVIncrement);
+ auto *TC = Plan.getTripCount();
+ Type *CanIVTy = TC->isLiveIn()
+ ? TC->getLiveInIRValue()->getType()
+ : cast<VPExpandSCEVRecipe>(TC)->getSCEV()->getType();
+ auto *I32Ty = Type::getInt32Ty(Plan.getContext());
+ VPValue *OpVPEVL = Builder.createScalarZExtOrTrunc(
+ OpVPEVLI32, CanIVTy, I32Ty, CanonicalIVIncrement->getDebugLoc());
+
+ CanonicalIVIncrement->setOperand(1, OpVPEVL);
+}
+
void VPlanTransforms::canonicalizeEVLLoops(VPlan &Plan) {
// Find EVL loop entries by locating VPEVLBasedIVPHIRecipe.
// There should be only one EVL PHI in the entire plan.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 69452a7e37572..bc5ce3bc43e76 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -269,6 +269,17 @@ struct VPlanTransforms {
/// (branch-on-cond eq AVLNext, 0)
static void canonicalizeEVLLoops(VPlan &Plan);
+ /// Applies to early-exit loops that use FFLoad. FFLoad may yield fewer active
+ /// lanes than VF. To prevent branch-on-poison and over-reads past the vector
+ /// trip count, use the returned VL for both stepping and exit computation.
+ /// Implemented by:
+ /// - adjustFFLoadEarlyExitForPoisonSafety: replace AnyOf with vp.reduce.or over
+ /// the first VL lanes; set AVL = min(VF, remainder).
+ /// - convertFFLoadEarlyExitToVLStepping: after region dissolution, convert
+ /// early-exit loops to variable-length stepping.
+ static void adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan);
+ static void convertFFLoadEarlyExitToVLStepping(VPlan &Plan);
+
/// Lower abstract recipes to concrete ones, that can be codegen'd.
static void convertToConcreteRecipes(VPlan &Plan);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanValue.h b/llvm/lib/Transforms/Vectorize/VPlanValue.h
index 0678bc90ef4b5..b2bc430a09686 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanValue.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanValue.h
@@ -40,6 +40,7 @@ class VPUser;
class VPRecipeBase;
class VPInterleaveBase;
class VPPhiAccessors;
+class VPWidenFFLoadRecipe;
// This is the base class of the VPlan Def/Use graph, used for modeling the data
// flow into, within and out of the VPlan. VPValues can stand for live-ins
@@ -51,6 +52,7 @@ class LLVM_ABI_FOR_TEST VPValue {
friend class VPInterleaveBase;
friend class VPlan;
friend class VPExpressionRecipe;
+ friend class VPWidenFFLoadRecipe;
const unsigned char SubclassID; ///< Subclass identifier (for isa/dyn_cast).
@@ -351,6 +353,7 @@ class VPDef {
VPWidenCastSC,
VPWidenGEPSC,
VPWidenIntrinsicSC,
+ VPWidenFFLoadSC,
VPWidenLoadEVLSC,
VPWidenLoadSC,
VPWidenStoreEVLSC,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 92caa0b4e51d5..70e6e0d006eb6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -166,8 +166,8 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const {
}
return VerifyEVLUse(*R, 2);
})
- .Case<VPWidenLoadEVLRecipe, VPVectorEndPointerRecipe,
- VPInterleaveEVLRecipe>(
+ .Case<VPWidenLoadEVLRecipe, VPWidenFFLoadRecipe,
+ VPVectorEndPointerRecipe, VPInterleaveEVLRecipe>(
[&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); })
.Case<VPInstructionWithType>(
[&](const VPInstructionWithType *S) { return VerifyEVLUse(*S, 0); })
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/find.ll b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll
new file mode 100644
index 0000000000000..f734bd5f53c82
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll
@@ -0...
[truncated]
|
|
@llvm/pr-subscribers-backend-risc-v Author: Shih-Po Hung (arcbbb) ChangesFollowing #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593. Key changes:
Limitations:
Patch is 34.01 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151300.diff 9 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 1d3cffa2b61bf..e28d4c45d4ab8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -393,6 +393,12 @@ static cl::opt<bool> EnableEarlyExitVectorization(
cl::desc(
"Enable vectorization of early exit loops with uncountable exits."));
+static cl::opt<bool>
+ EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false),
+ cl::Hidden,
+ cl::desc("Enable vectorization of early-exit "
+ "loops with fault-only-first loads."));
+
static cl::opt<bool> ConsiderRegPressure(
"vectorizer-consider-reg-pressure", cl::init(false), cl::Hidden,
cl::desc("Discard VFs if their register pressure is too high."));
@@ -3507,6 +3513,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
return FixedScalableVFPair::getNone();
}
+ if (!Legal->getPotentiallyFaultingLoads().empty() && UserIC > 1) {
+ reportVectorizationFailure("Auto-vectorization of loops with potentially "
+ "faulting loads is not supported when the "
+ "interleave count is more than 1",
+ "CantInterleaveLoopWithPotentiallyFaultingLoads",
+ ORE, TheLoop);
+ return FixedScalableVFPair::getNone();
+ }
+
ScalarEvolution *SE = PSE.getSE();
ElementCount TC = getSmallConstantTripCount(SE, TheLoop);
unsigned MaxTC = PSE.getSmallConstantMaxTripCount();
@@ -4076,6 +4091,7 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
case VPDef::VPReductionPHISC:
case VPDef::VPInterleaveEVLSC:
case VPDef::VPInterleaveSC:
+ case VPDef::VPWidenFFLoadSC:
case VPDef::VPWidenLoadEVLSC:
case VPDef::VPWidenLoadSC:
case VPDef::VPWidenStoreEVLSC:
@@ -4550,6 +4566,10 @@ LoopVectorizationPlanner::selectInterleaveCount(VPlan &Plan, ElementCount VF,
if (!Legal->isSafeForAnyVectorWidth())
return 1;
+ // No interleaving for potentially faulting loads.
+ if (!Legal->getPotentiallyFaultingLoads().empty())
+ return 1;
+
// We don't attempt to perform interleaving for loops with uncountable early
// exits because the VPInstruction::AnyOf code cannot currently handle
// multiple parts.
@@ -7216,6 +7236,9 @@ DenseMap<const SCEV *, Value *> LoopVectorizationPlanner::executePlan(
// Regions are dissolved after optimizing for VF and UF, which completely
// removes unneeded loop regions first.
VPlanTransforms::dissolveLoopRegions(BestVPlan);
+
+ VPlanTransforms::convertFFLoadEarlyExitToVLStepping(BestVPlan);
+
// Canonicalize EVL loops after regions are dissolved.
VPlanTransforms::canonicalizeEVLLoops(BestVPlan);
VPlanTransforms::materializeBackedgeTakenCount(BestVPlan, VectorPH);
@@ -7598,6 +7621,10 @@ VPRecipeBuilder::tryToWidenMemory(Instruction *I, ArrayRef<VPValue *> Operands,
Builder.insert(VectorPtr);
Ptr = VectorPtr;
}
+ if (Legal->getPotentiallyFaultingLoads().contains(I))
+ return new VPWidenFFLoadRecipe(*cast<LoadInst>(I), Ptr, &Plan.getVF(), Mask,
+ VPIRMetadata(*I, LVer), I->getDebugLoc());
+
if (LoadInst *Load = dyn_cast<LoadInst>(I))
return new VPWidenLoadRecipe(*Load, Ptr, Mask, Consecutive, Reverse,
VPIRMetadata(*Load, LVer), I->getDebugLoc());
@@ -8632,6 +8659,10 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
if (Recipe->getNumDefinedValues() == 1) {
SingleDef->replaceAllUsesWith(Recipe->getVPSingleValue());
Old2New[SingleDef] = Recipe->getVPSingleValue();
+ } else if (isa<VPWidenFFLoadRecipe>(Recipe)) {
+ VPValue *Data = Recipe->getVPValue(0);
+ SingleDef->replaceAllUsesWith(Data);
+ Old2New[SingleDef] = Data;
} else {
assert(Recipe->getNumDefinedValues() == 0 &&
"Unexpected multidef recipe");
@@ -8679,6 +8710,8 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
// Adjust the recipes for any inloop reductions.
adjustRecipesForReductions(Plan, RecipeBuilder, Range.Start);
+ VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(*Plan);
+
// Apply mandatory transformation to handle FP maxnum/minnum reduction with
// NaNs if possible, bail out otherwise.
if (!VPlanTransforms::runPass(VPlanTransforms::handleMaxMinNumReductions,
@@ -9869,7 +9902,14 @@ bool LoopVectorizePass::processLoop(Loop *L) {
return false;
}
- if (!LVL.getPotentiallyFaultingLoads().empty()) {
+ if (EnableEarlyExitWithFFLoads) {
+ if (LVL.getPotentiallyFaultingLoads().size() > 1) {
+ reportVectorizationFailure("Auto-vectorization of loops with more than 1 "
+ "potentially faulting load is not enabled",
+ "MoreThanOnePotentiallyFaultingLoad", ORE, L);
+ return false;
+ }
+ } else if (!LVL.getPotentiallyFaultingLoads().empty()) {
reportVectorizationFailure("Auto-vectorization of loops with potentially "
"faulting load is not supported",
"PotentiallyFaultingLoadsNotSupported", ORE, L);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index f79855f7e2c5f..6e28c95ca601a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -563,6 +563,7 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
case VPRecipeBase::VPInterleaveEVLSC:
case VPRecipeBase::VPInterleaveSC:
case VPRecipeBase::VPIRInstructionSC:
+ case VPRecipeBase::VPWidenFFLoadSC:
case VPRecipeBase::VPWidenLoadEVLSC:
case VPRecipeBase::VPWidenLoadSC:
case VPRecipeBase::VPWidenStoreEVLSC:
@@ -2811,6 +2812,13 @@ class LLVM_ABI_FOR_TEST VPReductionEVLRecipe : public VPReductionRecipe {
ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp,
R.isOrdered(), DL) {}
+ VPReductionEVLRecipe(RecurKind RdxKind, FastMathFlags FMFs, VPValue *ChainOp,
+ VPValue *VecOp, VPValue &EVL, VPValue *CondOp,
+ bool IsOrdered, DebugLoc DL = DebugLoc::getUnknown())
+ : VPReductionRecipe(VPDef::VPReductionEVLSC, RdxKind, FMFs, nullptr,
+ ArrayRef<VPValue *>({ChainOp, VecOp, &EVL}), CondOp,
+ IsOrdered, DL) {}
+
~VPReductionEVLRecipe() override = default;
VPReductionEVLRecipe *clone() override {
@@ -3159,6 +3167,7 @@ class LLVM_ABI_FOR_TEST VPWidenMemoryRecipe : public VPRecipeBase,
static inline bool classof(const VPRecipeBase *R) {
return R->getVPDefID() == VPRecipeBase::VPWidenLoadSC ||
R->getVPDefID() == VPRecipeBase::VPWidenStoreSC ||
+ R->getVPDefID() == VPRecipeBase::VPWidenFFLoadSC ||
R->getVPDefID() == VPRecipeBase::VPWidenLoadEVLSC ||
R->getVPDefID() == VPRecipeBase::VPWidenStoreEVLSC;
}
@@ -3240,6 +3249,42 @@ struct LLVM_ABI_FOR_TEST VPWidenLoadRecipe final : public VPWidenMemoryRecipe,
}
};
+/// A recipe for widening loads using fault-only-first intrinsics.
+/// Produces two results: (1) the loaded data, and (2) the index of the first
+/// non-dereferenceable lane, or VF if all lanes are successfully read.
+struct VPWidenFFLoadRecipe final : public VPWidenMemoryRecipe, public VPValue {
+ VPWidenFFLoadRecipe(LoadInst &Load, VPValue *Addr, VPValue *VF, VPValue *Mask,
+ const VPIRMetadata &Metadata, DebugLoc DL)
+ : VPWidenMemoryRecipe(VPDef::VPWidenFFLoadSC, Load, {Addr, VF},
+ /*Consecutive*/ true, /*Reverse*/ false, Metadata,
+ DL),
+ VPValue(this, &Load) {
+ new VPValue(nullptr, this); // Index of the first lane that faults.
+ setMask(Mask);
+ }
+
+ VP_CLASSOF_IMPL(VPDef::VPWidenFFLoadSC);
+
+ /// Return the VF operand.
+ VPValue *getVF() const { return getOperand(1); }
+ void setVF(VPValue *V) { setOperand(1, V); }
+
+ void execute(VPTransformState &State) override;
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+ /// Print the recipe.
+ void print(raw_ostream &O, const Twine &Indent,
+ VPSlotTracker &SlotTracker) const override;
+#endif
+
+ /// Returns true if the recipe only uses the first lane of operand \p Op.
+ bool onlyFirstLaneUsed(const VPValue *Op) const override {
+ assert(is_contained(operands(), Op) &&
+ "Op must be an operand of the recipe");
+ return Op == getVF() || Op == getAddr();
+ }
+};
+
/// A recipe for widening load operations with vector-predication intrinsics,
/// using the address to load from, the explicit vector length and an optional
/// mask.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index 46ab7712e2671..684dbd25597e3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -188,8 +188,9 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenCallRecipe *R) {
}
Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPWidenMemoryRecipe *R) {
- assert((isa<VPWidenLoadRecipe, VPWidenLoadEVLRecipe>(R)) &&
- "Store recipes should not define any values");
+ assert(
+ (isa<VPWidenLoadRecipe, VPWidenFFLoadRecipe, VPWidenLoadEVLRecipe>(R)) &&
+ "Store recipes should not define any values");
return cast<LoadInst>(&R->getIngredient())->getType();
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 8e9c3db50319f..3da8613a1d3cc 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -73,6 +73,7 @@ bool VPRecipeBase::mayWriteToMemory() const {
case VPReductionPHISC:
case VPScalarIVStepsSC:
case VPPredInstPHISC:
+ case VPWidenFFLoadSC:
return false;
case VPBlendSC:
case VPReductionEVLSC:
@@ -107,6 +108,7 @@ bool VPRecipeBase::mayReadFromMemory() const {
return cast<VPInstruction>(this)->opcodeMayReadOrWriteFromMemory();
case VPWidenLoadEVLSC:
case VPWidenLoadSC:
+ case VPWidenFFLoadSC:
return true;
case VPReplicateSC:
return cast<Instruction>(getVPSingleValue()->getUnderlyingValue())
@@ -3409,6 +3411,47 @@ void VPWidenLoadRecipe::print(raw_ostream &O, const Twine &Indent,
}
#endif
+void VPWidenFFLoadRecipe::execute(VPTransformState &State) {
+ Type *ScalarDataTy = getLoadStoreType(&Ingredient);
+ auto *DataTy = VectorType::get(ScalarDataTy, State.VF);
+ const Align Alignment = getLoadStoreAlignment(&Ingredient);
+
+ auto &Builder = State.Builder;
+ State.setDebugLocFrom(getDebugLoc());
+
+ Value *VL = State.get(getVF(), VPLane(0));
+ Type *I32Ty = Builder.getInt32Ty();
+ VL = Builder.CreateZExtOrTrunc(VL, I32Ty);
+ Value *Addr = State.get(getAddr(), true);
+ Value *Mask = nullptr;
+ if (VPValue *VPMask = getMask())
+ Mask = State.get(VPMask);
+ else
+ Mask = Builder.CreateVectorSplat(State.VF, Builder.getTrue());
+ CallInst *NewLI =
+ Builder.CreateIntrinsic(Intrinsic::vp_load_ff, {DataTy, Addr->getType()},
+ {Addr, Mask, VL}, nullptr, "vp.op.load.ff");
+ NewLI->addParamAttr(
+ 0, Attribute::getWithAlignment(NewLI->getContext(), Alignment));
+ applyMetadata(*NewLI);
+ Value *V = cast<Instruction>(Builder.CreateExtractValue(NewLI, 0));
+ Value *NewVL = Builder.CreateExtractValue(NewLI, 1);
+ State.set(getVPValue(0), V);
+ State.set(getVPValue(1), NewVL, /*NeedsScalar=*/true);
+}
+
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+void VPWidenFFLoadRecipe::print(raw_ostream &O, const Twine &Indent,
+ VPSlotTracker &SlotTracker) const {
+ O << Indent << "WIDEN ";
+ printAsOperand(O, SlotTracker);
+ O << ", ";
+ getVPValue(1)->printAsOperand(O, SlotTracker);
+ O << " = vp.load.ff ";
+ printOperands(O, SlotTracker);
+}
+#endif
+
/// Use all-true mask for reverse rather than actual mask, as it avoids a
/// dependence w/o affecting the result.
static Instruction *createReverseEVL(IRBuilderBase &Builder, Value *Operand,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 1f6b85270607e..7e78cb6ed02ac 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2760,6 +2760,102 @@ void VPlanTransforms::addExplicitVectorLength(
Plan.setUF(1);
}
+void VPlanTransforms::adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan) {
+ VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock();
+ VPWidenFFLoadRecipe *LastFFLoad = nullptr;
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
+ vp_depth_first_deep(Plan.getVectorLoopRegion())))
+ for (VPRecipeBase &R : *VPBB)
+ if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) {
+ assert(!LastFFLoad && "Only one FFLoad is supported");
+ LastFFLoad = Load;
+ }
+
+ // Skip if no FFLoad.
+ if (!LastFFLoad)
+ return;
+
+ // Ensure FFLoad does not read past the remainder in the last iteration.
+ // Set AVL to min(VF, remainder).
+ VPBuilder Builder(Header, Header->getFirstNonPhi());
+ VPValue *Remainder = Builder.createNaryOp(
+ Instruction::Sub, {&Plan.getVectorTripCount(), Plan.getCanonicalIV()});
+ VPValue *Cmp =
+ Builder.createICmp(CmpInst::ICMP_ULE, &Plan.getVF(), Remainder);
+ VPValue *AVL = Builder.createSelect(Cmp, &Plan.getVF(), Remainder);
+ LastFFLoad->setVF(AVL);
+
+ // To prevent branch-on-poison, rewrite the early-exit condition to
+ // VPReductionEVLRecipe. Expected pattern here is:
+ // EMIT vp<%alt.exit.cond> = AnyOf
+ // EMIT vp<%exit.cond> = or vp<%alt.exit.cond>, vp<%main.exit.cond>
+ // EMIT branch-on-cond vp<%exit.cond>
+ auto *ExitingLatch = cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getExiting());
+ auto *LatchExitingBr = cast<VPInstruction>(ExitingLatch->getTerminator());
+
+ VPValue *VPAnyOf = nullptr;
+ VPValue *VecOp = nullptr;
+ assert(
+ match(LatchExitingBr,
+ m_BranchOnCond(m_BinaryOr(m_VPValue(VPAnyOf), m_VPValue()))) &&
+ match(VPAnyOf, m_VPInstruction<VPInstruction::AnyOf>(m_VPValue(VecOp))) &&
+ "unexpected exiting sequence in early exit loop");
+
+ VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1);
+ VPValue *Mask = LastFFLoad->getMask();
+ FastMathFlags FMF;
+ auto *I1Ty = Type::getInt1Ty(Plan.getContext());
+ VPValue *VPZero = Plan.getOrAddLiveIn(ConstantInt::get(I1Ty, 0));
+ DebugLoc DL = VPAnyOf->getDefiningRecipe()->getDebugLoc();
+ auto *NewAnyOf =
+ new VPReductionEVLRecipe(RecurKind::Or, FMF, VPZero, VecOp, *OpVPEVLI32,
+ Mask, /*IsOrdered*/ false, DL);
+ NewAnyOf->insertBefore(VPAnyOf->getDefiningRecipe());
+ VPAnyOf->replaceAllUsesWith(NewAnyOf);
+
+ // Using FirstActiveLane in the early-exit block is safe,
+ // exiting conditions guarantees at least one valid lane precedes
+ // any poisoned lanes.
+}
+
+void VPlanTransforms::convertFFLoadEarlyExitToVLStepping(VPlan &Plan) {
+ // Find loop header by locating VPWidenFFLoadRecipe.
+ VPWidenFFLoadRecipe *LastFFLoad = nullptr;
+
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(
+ vp_depth_first_shallow(Plan.getEntry())))
+ for (VPRecipeBase &R : *VPBB)
+ if (auto *Load = dyn_cast<VPWidenFFLoadRecipe>(&R)) {
+ assert(!LastFFLoad && "Only one FFLoad is supported");
+ LastFFLoad = Load;
+ }
+
+ // Skip if no FFLoad.
+ if (!LastFFLoad)
+ return;
+
+ VPBasicBlock *HeaderVPBB = LastFFLoad->getParent();
+ // Replace IVStep (VFxUF) with returned VL from FFLoad.
+ auto *CanonicalIV = cast<VPPhi>(&*HeaderVPBB->begin());
+ VPValue *Backedge = CanonicalIV->getIncomingValue(1);
+ assert(match(Backedge, m_c_Add(m_Specific(CanonicalIV),
+ m_Specific(&Plan.getVFxUF()))) &&
+ "Unexpected canonical iv");
+ VPRecipeBase *CanonicalIVIncrement = Backedge->getDefiningRecipe();
+ VPValue *OpVPEVLI32 = LastFFLoad->getVPValue(1);
+ VPBuilder Builder(HeaderVPBB, HeaderVPBB->getFirstNonPhi());
+ Builder.setInsertPoint(CanonicalIVIncrement);
+ auto *TC = Plan.getTripCount();
+ Type *CanIVTy = TC->isLiveIn()
+ ? TC->getLiveInIRValue()->getType()
+ : cast<VPExpandSCEVRecipe>(TC)->getSCEV()->getType();
+ auto *I32Ty = Type::getInt32Ty(Plan.getContext());
+ VPValue *OpVPEVL = Builder.createScalarZExtOrTrunc(
+ OpVPEVLI32, CanIVTy, I32Ty, CanonicalIVIncrement->getDebugLoc());
+
+ CanonicalIVIncrement->setOperand(1, OpVPEVL);
+}
+
void VPlanTransforms::canonicalizeEVLLoops(VPlan &Plan) {
// Find EVL loop entries by locating VPEVLBasedIVPHIRecipe.
// There should be only one EVL PHI in the entire plan.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 69452a7e37572..bc5ce3bc43e76 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -269,6 +269,17 @@ struct VPlanTransforms {
/// (branch-on-cond eq AVLNext, 0)
static void canonicalizeEVLLoops(VPlan &Plan);
+ /// Applies to early-exit loops that use FFLoad. FFLoad may yield fewer active
+ /// lanes than VF. To prevent branch-on-poison and over-reads past the vector
+ /// trip count, use the returned VL for both stepping and exit computation.
+ /// Implemented by:
+ /// - adjustFFLoadEarlyExitForPoisonSafety: replace AnyOf with vp.reduce.or over
+ /// the first VL lanes; set AVL = min(VF, remainder).
+ /// - convertFFLoadEarlyExitToVLStepping: after region dissolution, convert
+ /// early-exit loops to variable-length stepping.
+ static void adjustFFLoadEarlyExitForPoisonSafety(VPlan &Plan);
+ static void convertFFLoadEarlyExitToVLStepping(VPlan &Plan);
+
/// Lower abstract recipes to concrete ones, that can be codegen'd.
static void convertToConcreteRecipes(VPlan &Plan);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanValue.h b/llvm/lib/Transforms/Vectorize/VPlanValue.h
index 0678bc90ef4b5..b2bc430a09686 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanValue.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanValue.h
@@ -40,6 +40,7 @@ class VPUser;
class VPRecipeBase;
class VPInterleaveBase;
class VPPhiAccessors;
+class VPWidenFFLoadRecipe;
// This is the base class of the VPlan Def/Use graph, used for modeling the data
// flow into, within and out of the VPlan. VPValues can stand for live-ins
@@ -51,6 +52,7 @@ class LLVM_ABI_FOR_TEST VPValue {
friend class VPInterleaveBase;
friend class VPlan;
friend class VPExpressionRecipe;
+ friend class VPWidenFFLoadRecipe;
const unsigned char SubclassID; ///< Subclass identifier (for isa/dyn_cast).
@@ -351,6 +353,7 @@ class VPDef {
VPWidenCastSC,
VPWidenGEPSC,
VPWidenIntrinsicSC,
+ VPWidenFFLoadSC,
VPWidenLoadEVLSC,
VPWidenLoadSC,
VPWidenStoreEVLSC,
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 92caa0b4e51d5..70e6e0d006eb6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -166,8 +166,8 @@ bool VPlanVerifier::verifyEVLRecipe(const VPInstruction &EVL) const {
}
return VerifyEVLUse(*R, 2);
})
- .Case<VPWidenLoadEVLRecipe, VPVectorEndPointerRecipe,
- VPInterleaveEVLRecipe>(
+ .Case<VPWidenLoadEVLRecipe, VPWidenFFLoadRecipe,
+ VPVectorEndPointerRecipe, VPInterleaveEVLRecipe>(
[&](const VPRecipeBase *R) { return VerifyEVLUse(*R, 1); })
.Case<VPInstructionWithType>(
[&](const VPInstructionWithType *S) { return VerifyEVLUse(*S, 0); })
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/find.ll b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll
new file mode 100644
index 0000000000000..f734bd5f53c82
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/find.ll
@@ -0...
[truncated]
|
|
Rebased after #152422 landed, and updated the PR description as well as title. |
b6add27 to
755ad37
Compare
If performance is a concern, I’m considering replacing the active‑lane‑mask + reduce_or sequence with a vfirst intrinsic + icmp in CodeGenPrepare. The idea is to rewrite: into: Would this be preferable?
Thanks for flagging early‑exit loops with tail folding. That is also a nice-to-have and I am keen to see it. |
If we have an early exit loop with non-dereferenceable loads after the exit, we currently bail:
int z;
for (int i = 0; i < N; i++) {
if (x[i])
break;
z = y[i];
}
If the early exit block dominates the block containing these loads, we can predicate them with mask like
for (int i = 0; i < N/VF; i++) {
c[0..VF] = x[i..i+VF]
z[0..VF] = y[i..i+VF], mask=c
if (anyof(c))
break;
}
In VPlan terms, this is `icmp ult step-vector, (first-active-lane exit-cond)`
VPlanPredicator can handle predicating these blocks, but in tryToBuildVPlanWithVPRecipes we first disconnect all early exits before the masks are introduced:
// entry -> exiting -> ... -> latch
// |
// +-----> earlyexit
VPlanTransforms::handleEarlyExits(*Plan);
// entry -> exiting -> ... -> latch
VPlanTransforms::introduceMasksAndLinearize(*Plan);
This is needed to keep the region single entry/single exit, but it also means that there isn't any control flow by the time we want to add the masks:
exiting:
%earlyexitcond = ...
// one successor (latch)
latch:
%exitcond = or (anyof %earlyexitcond), %origexitcond
br %exitcond, entry, exit
This patch propagates the information to VPlanPredicator that the successors should be predicated even though there isn't actually control flow with a new EarlyExit VPInstruction:
exiting:
%earlyexitcond = ...
earlyexit (icmp ult step-vector, (first-active-lane %earlyexitcond)
// one successor (latch)
latch:
%exitcond = or (anyof %earlyexitcond), origexitcond
br %exitcond, entry, exit
It's just a placeholder and gets immediately removed whenever VPlanPredicator sees it, but allows it to use the correct mask.
This makes way for supporting more types of loops, as we could also support stores/divs etc. as long as the exiting block dominates them. See the note in canUncountableExitConditionLoadBeMoved as to why we can't predicate stores when they're before the exiting block.
However the main motivation for this to allow us to support tail folding with early exits, which I believe will be needed to make supporting fault-only-first loads simpler: llvm#151300
In order to actually test the changes from this, this PR allows non-dereferenceable loads that are properly dominated by the exiting block in LoopVectorizationLegality. But in practice, something else usually transforms these loops to be multiple-entry which prevents them from being vectorized.
| "Enable vectorization of early exit loops with uncountable exits.")); | ||
|
|
||
| static cl::opt<bool> | ||
| EnableEarlyExitWithFFLoads("enable-early-exit-with-ffload", cl::init(false), |
There was a problem hiding this comment.
Does it work for other targets besides RISCV? If so, the PR should have some tests for other backends too, or even better in the top level Transforms/LoopVectorize directory.
There was a problem hiding this comment.
I think we need tests in the top level LoopVectorize directory too. For example, I think it's worth adding an extra RUN line with the -enable-early-exit-with-ffload flag to Transforms/LoopVectorize/single_early_exit.ll and Transforms/LoopVectorize/single_early_exit_live_outs.ll.
Those tests have a very extensive coverage of different CFGs and combinations of live-outs in different scenarios.
43bbad6 to
adc53d4
Compare
adc53d4 to
cd23a06
Compare
- Merge `canonicalizeEVLLoops` and `convertFFLoadEarlyExitToVLStepping` into a single entry point `convertLoopIntoVariableLengthStepping`. Both transforms convert loops to use variable-length stepping after region dissolution. - Generalize poison-safety handling in `adjustFFLoadEarlyExitForPoisonSafety` to mask all AnyOf users of FFLoad (not just a single hardcoded pattern).
cd23a06 to
3d72166
Compare
You can test this locally with the following command:git diff origin/main HEAD -- 'llvm/include/llvm/**/*.h' 'llvm/include/llvm-c/**/*.h' 'llvm/include/llvm/Demangle/**/*.h'
Then run idt on the changed files with appropriate --export-macro and --include-header flags.
View the diff from ids-check here.diff --git a/llvm/include/llvm/IR/Type.h b/llvm/include/llvm/IR/Type.h
index 1c042500b..6cfccad75 100644
--- a/llvm/include/llvm/IR/Type.h
+++ b/llvm/include/llvm/IR/Type.h
@@ -14,6 +14,7 @@
#ifndef LLVM_IR_TYPE_H
#define LLVM_IR_TYPE_H
+#include "llvm/include/llvm/Support/Compiler.h"
#include "llvm/ADT/ArrayRef.h"
#include "llvm/Support/CBindingWrapping.h"
#include "llvm/Support/Casting.h"
@@ -234,7 +235,7 @@ public:
bool isTokenTy() const { return getTypeID() == TokenTyID; }
/// Returns true if this is 'token' or a token-like target type.s
- bool isTokenLikeTy() const;
+ LLVM_ABI bool isTokenLikeTy() const;
/// True if this is an instance of IntegerType.
bool isIntegerTy() const { return getTypeID() == IntegerTyID; }
|
There was a problem hiding this comment.
I've opened up some thoughts on fault-only-first loads in #175900, but thought I'd mention some things on this PR that I'd like to hear your thoughts on:
- fault-only-first loads require variable stepping, which you've converted in this PR. But I think we still need to update induction recipes to account for the fact that they no longer step by VF?
@llvm.experimental.get.vector.lengthhas the property that the step amount may vary on the penultimate iteration, but otherwise the vector loop will always be executed the same number of iterations as a non tail folded loop. i.e. the backedge taken count is always the same.
I don't think this is true of@llvm.vp.load.ff. It's valid for a vle*ff.v to reduce VL to one for every iteration, which means the vector trip count may change. Do we need to go through any users of the vector trip count and check if they need adapted?- We don't predicate any recipes after the
vp.load.ff. I think this happens to work at the moment because we currently don't allow any non-speculatable recipes in early-exit loops, but this won't be the case after #172454.
I think it would be good to handle this logic together in this PR, otherwise it will surprisingly break when more non-speculatable recipes are allowed. I think landing #172454 first would allow us to write test cases for this.
Good catch! I missed this when porting the patch. I'll add a test for it.
Non-tail-folded loops don't need changes because VTC is (N - (N % Step)), and its uses in the exit condition and resume-phi treat it as a cumulative processed count.
I believe #172454 predicates those non-speculatable recipes, so even after it lands, they'll be guarded by masks. We'll still need to gate those masks with the faulting lane. |
Thanks for checking, I got my terminology mixed up though. I think I originally meant to ask about the "vector backedge taken count" but that doesn't seem to be a constant that we materialize in the loop vectorizer. My concern is do we have any behaviour that implicitly relies on the vector backedge taken count always being VTC / VFxUF? We might not have any but it would be good to check.
#172454 only predicates lanes off past the early exit, but doesn't predicate any lanes that need to be masked off because vp.load.ff reduced the VL. |
… stepping This is groundwork for llvm#151300, which aims to support first-faulting loads in non-tail-folded early-exit loops. Per llvm#175900, we need a variable-length stepping transform that can be shared between EVL and non-EVL loops. The idea is to have an EVL-independent counter and transform for tracking the cumulative number of processed elements. This patch renames the existing counter (VPEVLBasedIVPHIRecipe) and transform (canonicalizeEVLLoops) to be EVL-independent: - Rename VPEVLBasedIVPHIRecipe to VPNumProcessedElementsPHIRecipe to reflect its general purpose of tracking processed element count. - In addExplicitVectorLength, switch the loop from up-counting to down-counting by transforming: (branch-on-count CanonicalIVInc, VTC) to: (branch-on-count (sub VTC, CanonicalIVInc), 0) - Rename canonicalizeEVLLoops to convertToVariableLengthStep and update the exit condition from: (branch-on-count (sub VectorTripCount, CanonicalIVInc), 0) -> (branch-on-count (sub TripCount, (add Step, CumulativeIV)), 0) -> (branch-on-count (sub AVL, Step), 0) if AVL is available.
This is groundwork for llvm#151300, which aims to support first-faulting loads in non-tail-folded early-exit loops. Per llvm#175900, we need a variable-length stepping transform that can shared between EVL and non-EVL loops. The idea is to have an EVL-independent counter and transform for tracking the cumulative number of processed elements. This patch renames the existing counter (VPEVLBasedIVPHIRecipe) and transform (canonicalizeEVLLoops) to be EVL-independent: - Rename VPEVLBasedIVPHIRecipe to VPCurrentIterationRecipe to reflect its general purpose of tracking processed element count. - Rename canonicalizeEVLLoops to convertToVariableLengthStep.
…ipe (#177114) This is groundwork for #151300, which aims to support first-faulting loads in non-tail-folded early-exit loops. Per #175900, we need a variable-length stepping transform that can shared between EVL and non-EVL loops. The idea is to have an EVL-independent counter and transform for tracking the cumulative number of processed elements. This patch renames the existing counter (VPEVLBasedIVPHIRecipe) and transform (canonicalizeEVLLoops) to be EVL-independent: - Rename VPEVLBasedIVPHIRecipe to VPCurrentIterationRecipe to reflect its general purpose of tracking processed element count. - Rename canonicalizeEVLLoops to convertToVariableLengthStep. This is NFC.
…ipe (llvm#177114) This is groundwork for llvm#151300, which aims to support first-faulting loads in non-tail-folded early-exit loops. Per llvm#175900, we need a variable-length stepping transform that can shared between EVL and non-EVL loops. The idea is to have an EVL-independent counter and transform for tracking the cumulative number of processed elements. This patch renames the existing counter (VPEVLBasedIVPHIRecipe) and transform (canonicalizeEVLLoops) to be EVL-independent: - Rename VPEVLBasedIVPHIRecipe to VPCurrentIterationRecipe to reflect its general purpose of tracking processed element count. - Rename canonicalizeEVLLoops to convertToVariableLengthStep. This is NFC.
…ipe (llvm#177114) This is groundwork for llvm#151300, which aims to support first-faulting loads in non-tail-folded early-exit loops. Per llvm#175900, we need a variable-length stepping transform that can shared between EVL and non-EVL loops. The idea is to have an EVL-independent counter and transform for tracking the cumulative number of processed elements. This patch renames the existing counter (VPEVLBasedIVPHIRecipe) and transform (canonicalizeEVLLoops) to be EVL-independent: - Rename VPEVLBasedIVPHIRecipe to VPCurrentIterationRecipe to reflect its general purpose of tracking processed element count. - Rename canonicalizeEVLLoops to convertToVariableLengthStep. This is NFC.
Following #152422, this patch enables auto-vectorization of early-exit loops containing a single potentially faulting, unit-stride load by using the vp.load.ff intrinsic introduced in #128593.
Key changes:
Limitations:
stacks on top of #165218.